Skip to content

cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 #3963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: 4.x
Choose a base branch
from

Conversation

cudawarped
Copy link
Contributor

@cudawarped cudawarped commented Jun 25, 2025

Draft fix for #3962.

Support for __shfl_down on long long was not introduced until CUDA Toolkit 9.0. I don't know if this is just software support or if hardware support was added as well. Its a long shot but it may be the reason that the tests are failing on Compute Capbability 5.3 devices.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@troelsy
Copy link
Contributor

troelsy commented Jun 30, 2025

Hi, I thought I would chime in as it relates to my recent PR. The company I work for uses Jetson TX2 with CC=6.2, CUDA Toolkit 10.2 and everything seems to work, so I took a look at it. It looks like all devices that support warp shuffle (CC≥3.0) will support shuffle with long long as long as the CUDA Toolkit ≥ 9.0.

To the question about if it is implemented in software or hardware, I think warp shuffle are always done 32 bit at a time because the registers are limited to 32 bit. It will just be two shuffles for 64 bit types. The PTX also indicate this in Compiler Explorer: https://godbolt.org/z/nxdYcqoWe. If the PTX view doesn't show, try opening a new compiler window.

I think the if-statement should be changed to check the CUDA Toolkit version instead. The current code will change the behavior on Jetson TX2 even though it should be supported. Does OpenCV specify a minimum version of CUDA Toolkit?

@cudawarped
Copy link
Contributor Author

@troelsy The flag was just a test to try and fix the crash on CC 5.3 devices. Do you have access to CUDA toolkit < 9.0 to test whether _shufl_down compiles or not on CUDA toolkit < 9.0? Godbolt doesn't have NVCC <= 9.1.85.

@troelsy
Copy link
Contributor

troelsy commented Jun 30, 2025

TX2 should be able to run CUDA Toolkit 8.0, but my department doesn't have access to the firmware, so I can't try it out

@cudawarped
Copy link
Contributor Author

@asmorkalov Do you have a machine with CUDA Toolkit < 9.0 on it to check this?

@asmorkalov
Copy link
Contributor

No, unfortunately. I want to deploy something with desktop PC, but after the 4.12 release. It's almost ready.

@cudawarped
Copy link
Contributor Author

Would you like me to kill this PR or change the #define from __CUDA_ARCH__ < 700 to __CUDACC_VER_MAJOR__ < 9?

@cudawarped cudawarped force-pushed the fix_shufl_down_on_cc_lt_70 branch from 3051f9a to af8945e Compare June 30, 2025 12:18
@cudawarped cudawarped changed the title cudev: Add __shfl_down implementation for long long and unsigned long on devices of CC < 7.0 cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants